An introduction to pandas

This post is not a tutorial but rather a collection of notes that I have taken while learning pandas. I will try to keep it updated as I learn more about pandas. I don’t want to write a tutorial because I want to keep it short and simple and mention only the features that I use most.

What is pandas?

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. Wikipedia

Pandas has become the de facto standard for data analysis in Python. It is a powerful library that provides a high-performance, easy-to-use data structures and data analysis tools. It is built on top of NumPy and Matplotlib.

Installation

A good practice is to create a virtual environment for each project. This will allow you to install the required packages without affecting the global Python installation. You can use the following command to create a conda environment:

                  conda create -n pandas_intro python=3.10
conda activate pandas_intro

Pandas is available on PyPI and can be installed using pip:

pip install pandas

Personally, I prefer using pandas in Jupyter Lab. The Jupyter Lab is a web application that allows you to create and run Jupyter notebooks.

To install Jupyter Lab, you need to install the latest version of JupyterLab from the official repository.

pip install jupyterlab

Or using the conda package:

conda install -c conda-forge jupyterlab

To launch Jupyter Lab, you need to start it from the command line:

jupyter-lab

After starting Jupyter Lab, you can open a new notebook and start writing your code your web browser.

Importing pandas

A common practice is to import pandas as pd :

                  import pandas as pd

                

Pandas data frames

Pandas data frames are a powerful data structure that allow you to work with tabular data. They are similar to the tables in relational databases.

Here is an example of a data frame created from a list of columns:

                  df = pd.DataFrame([
  ['A', 'B', 'C'],
  [1, 2, 3],
  [4, 5, 6]
])

                

The data frame has three columns and three rows.

                  df.shape

                

Here is an example of a data frame created from a dictionary:

                  df = pd.DataFrame({
  'A': [1, 4],
  'B': [2, 5],
  'C': [3, 6]
})

                

Pandas data frame can be created from an excel file:

                  df = pd.read_excel('data.xlsx')

                

The resulting data is a dictionary of data frames. The keys of the dictionary are the sheet names.

                  df['Sheet1']

for df_sheet in df:
  print(df_sheet.shape)

Data frame indexing and slicing

Data frames can be indexed by row and column.

                  df.iloc[0] # row 0
df.iloc[0, 0] # row 0, column 0
df.loc[0] # row 0
df.loc[0, 'A'] # row 0, column A

                

I personally prefer to use the loc method because it is more intuitive.

                  df.loc[0, 'A'] # row 0, column A
df.loc[0, ['A', 'B']] # row 0, columns A and B

A data frame can be sliced by row and column:

                  df.iloc[0:2] # rows 0 and 1
df.iloc[0:2, 0:2] # rows 0 and 1, columns 0 and 1
df.loc[0:2] # rows 0 and 1
df.loc[0:2, 'A'] # rows 0 and 1, column A
df.loc[0:2, ['A', 'B']] # rows 0 and 1, columns A and B

                

Data frame basic operations

In this section, we will see how to perform basic operations on data frames.

Adding rows and columns

To add a new row at the end of the data frame, you can use the append method:

                  df = df.append({
  'A': 7,
  'B': 8,
  'C': 9
}, ignore_index=True) # ignore_index=True to reset the index

                

To add a new column, you can use the assign method:

                  df = df.assign(D=[10, 11, 12])

                

To add a new row at a specific position, you can use the insert method:

                  df = df.insert(0, 'E', [13, 14, 15])

                

Deleting rows and columns

The drop method can be used to delete rows and columns:

                  df.drop(0) # delete row 0
df.drop(columns=['A'], axis=1) # delete column A

Sometimes you may want to delete rows and columns that contain missing values.

                  df.dropna(axis=0, how='any') # delete rows that contain missing values
df.dropna(axis=1, how='any') # delete columns that contain missing values

To drop a column than contains only missing values:

                  df.dropna(axis=1, how='all') # delete columns that do not contain missing values

                

To drop duplicate rows:

                  df.drop_duplicates(keep='first', inplace=True) # delete duplicate rows except the first one (inplace=True to modify the data frame)

                

Editing data

To edit a cell, you can use the at method:

                  df.at[0, 'A'] = 0 # replace data in row 0, column A with 0

                

Replacing data

To replace data in a data frame, you can use the replace method:

                  df.replace(to_replace=1, value=0) # replace 1 with 0

                

A simpler and more intuitive way to replace data is to use the map method:

                  df['A'].map({1: 0, 4: 7}) # replace 1 with 0 and 4 with 7

                

To replace data in a column, you can use the replace method:

                  df['A'] = df['A'].str.replace('A', 'B') # replace A with B

                

String operations

For string operations, you can use the str property methods:

                  df['A'] = df['A'].str.lower() # convert to lowercase
df['A'] = df['A'].str.upper() # convert to uppercase
df['A'] = df['A'].str.strip() # remove leading and trailing spaces
df['A'] = df['A'].str.split(' ') # split by space
df['A'] = df['A'].str.replace(' ', '_') # replace space with underscore
df['A'] = df['A'].str.capitalize() # capitalize first letter
df['A'] = df['A'].str.replace(r'(\d+)', r'(\1)', regex=True) # replace digits with parentheses using regex

                

Filtering rows and columns

A data frame can be filtered by a condition expressed as a boolean expression (a mask).

                  mask = df['A'] > 1
masked_df = df[mask]

Data frame sorting

                  df.sort_values(by='A') # sort by column A
df.sort_values(by=['A', 'B']) # sort by columns A and B

Data frame merging

                  df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df1.merge(df2, on='A') # merge on column A

                

Data frame aggregation

The groupby method can be used to group data by a column. In this example, the data is grouped by the first column and the mean of the second column is calculated:

                  df.groupby('A').mean() # group by column A and compute mean

                

Data frame concatenation

To be able to concatenate data frames, the data frames must have the same columns.

                  df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
pd.concat([df1, df2]) # concatenate df1 and df2

                

Data frame statistics

Statistics

The describe method can be used to calculate basic statistics:

                  df.describe()

                

Correlation

Correlation is a measure of the strength of the relationship between two variables. intuitively, it is a measure of how much one variable is related to the other.

In this example, correlation is the correlation between the first and second column:

                  df.corr()

                

Data frame plotting

Pandas uses matplotlib to plot data frames. Pandas data frames can be plotted using the plot method:

                  df.plot(kind='scatter', x='A', y='B')

                

Data frame conversion

A data frame can be converted to a multitude of formats. Personally I convert data frames to dictionaries and lists.

Converting to a dictionary

To convert a data frame to a dictionary, you can use the to_dict method:

                  df.to_dict()

                

Exporting to a json file

To export a data frame to a json file, you can use the to_json method:

                  with open('data.json', 'w') as f:
  df.to_json(f)

If the data frame contains unicode characters, you can use the ensure_ascii parameter:

                  with open('data.json', 'w', encoding='utf-8') as f:
  df.to_json(f, ensure_ascii=False)

Converting to a list of dictionaries

To convert a data frame to a list of dictionaries, you can use the to_dict method setting the orient parameter to records :

                  df.to_dict(orient='records')

                

Converting to a list of lists

To convert a data frame to a list of lists, you can use the values method:

                  df.values.tolist()

                

References

From the pandas documentation:

Articles:

Add new rows and columns to Pandas dataframe

Tutorials:

Share on

Twitter Facebook LinkedIn